103 research outputs found

    Ward's Hierarchical Clustering Method: Clustering Criterion and Agglomerative Algorithm

    Full text link
    The Ward error sum of squares hierarchical clustering method has been very widely used since its first description by Ward in a 1963 publication. It has also been generalized in various ways. However there are different interpretations in the literature and there are different implementations of the Ward agglomerative algorithm in commonly used software systems, including differing expressions of the agglomerative criterion. Our survey work and case studies will be useful for all those involved in developing software for data analysis using Ward's hierarchical clustering method.Comment: 20 pages, 21 citations, 4 figure

    MaxMin Linear Initialization for Fuzzy C-Means

    Get PDF
    International audienceClustering is an extensive research area in data science. The aim of clustering is to discover groups and to identify interesting patterns in datasets. Crisp (hard) clustering considers that each data point belongs to one and only one cluster. However, it is inadequate as some data points may belong to several clusters, as is the case in text categorization. Thus, we need more flexible clustering. Fuzzy clustering methods, where each data point can belong to several clusters, are an interesting alternative. Yet, seeding iterative fuzzy algorithms to achieve high quality clustering is an issue. In this paper, we propose a new linear and efficient initialization algorithm MaxMin Linear to deal with this problem. Then, we validate our theoretical results through extensive experiments on a variety of numerical real-world and artificial datasets. We also test several validity indices, including a new validity index that we propose, Transformed Standardized Fuzzy Difference (TSFD)

    Commonality Preserving Multiple Instance Clustering Based on Diverse Density

    Full text link
    Abstract. Image-set clustering is a problem decomposing a given im-age set into disjoint subsets satisfying specied criteria. For single vector image representations, proximity or similarity criterion is widely applied, i.e., proximal or similar images form a cluster. Recent trend of the im-age description, however, is the local feature based, i.e., an image is described by multiple local features, e.g., SIFT, SURF, and so on. In this description, which criterion should be employed for the clustering? As an answer to this question, this paper presents an image-set clus-tering method based on commonality, that is, images preserving strong commonality (coherent local features) form a cluster. In this criterion, image variations that do not affect common features are harmless. In the case of face images, hair-style changes and partial occlusions by glasses may not affect the cluster formation. We dened four commonality mea-sures based on Diverse Density, that are used in agglomerative clustering. Through comparative experiments, we conrmed that two of our meth-ods perform better than other methods examined in the experiments.

    Linear, Deterministic, and Order-Invariant Initialization Methods for the K-Means Clustering Algorithm

    Full text link
    Over the past five decades, k-means has become the clustering algorithm of choice in many application domains primarily due to its simplicity, time/space efficiency, and invariance to the ordering of the data points. Unfortunately, the algorithm's sensitivity to the initial selection of the cluster centers remains to be its most serious drawback. Numerous initialization methods have been proposed to address this drawback. Many of these methods, however, have time complexity superlinear in the number of data points, which makes them impractical for large data sets. On the other hand, linear methods are often random and/or sensitive to the ordering of the data points. These methods are generally unreliable in that the quality of their results is unpredictable. Therefore, it is common practice to perform multiple runs of such methods and take the output of the run that produces the best results. Such a practice, however, greatly increases the computational requirements of the otherwise highly efficient k-means algorithm. In this chapter, we investigate the empirical performance of six linear, deterministic (non-random), and order-invariant k-means initialization methods on a large and diverse collection of data sets from the UCI Machine Learning Repository. The results demonstrate that two relatively unknown hierarchical initialization methods due to Su and Dy outperform the remaining four methods with respect to two objective effectiveness criteria. In addition, a recent method due to Erisoglu et al. performs surprisingly poorly.Comment: 21 pages, 2 figures, 5 tables, Partitional Clustering Algorithms (Springer, 2014). arXiv admin note: substantial text overlap with arXiv:1304.7465, arXiv:1209.196

    Genomic distance entrained clustering and regression modelling highlights interacting genomic regions contributing to proliferation in breast cancer

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Genomic copy number changes and regional alterations in epigenetic states have been linked to grade in breast cancer. However, the relative contribution of specific alterations to the pathology of different breast cancer subtypes remains unclear. The heterogeneity and interplay of genomic and epigenetic variations means that large datasets and statistical data mining methods are required to uncover recurrent patterns that are likely to be important in cancer progression.</p> <p>Results</p> <p>We employed ridge regression to model the relationship between regional changes in gene expression and proliferation. Regional features were extracted from tumour gene expression data using a novel clustering method, called genomic distance entrained agglomerative (GDEC) clustering. Using gene expression data in this way provides a simple means of integrating the phenotypic effects of both copy number aberrations and alterations in chromatin state. We show that regional metagenes derived from GDEC clustering are representative of recurrent regions of epigenetic regulation or copy number aberrations in breast cancer. Furthermore, detected patterns of genomic alterations are conserved across independent oestrogen receptor positive breast cancer datasets. Sequential competitive metagene selection was used to reveal the relative importance of genomic regions in predicting proliferation rate. The predictive model suggested additive interactions between the most informative regions such as 8p22-12 and 8q13-22.</p> <p>Conclusions</p> <p>Data-mining of large-scale microarray gene expression datasets can reveal regional clusters of co-ordinate gene expression, independent of cause. By correlating these clusters with tumour proliferation we have identified a number of genomic regions that act together to promote proliferation in ER+ breast cancer. Identification of such regions should enable prioritisation of genomic regions for combinatorial functional studies to pinpoint the key genes and interactions contributing to tumourigenicity.</p

    Leveraging structure determination with fragment screening for infectious disease drug targets: MECP synthase from Burkholderia pseudomallei

    Get PDF
    As part of the Seattle Structural Genomics Center for Infectious Disease, we seek to enhance structural genomics with ligand-bound structure data which can serve as a blueprint for structure-based drug design. We have adapted fragment-based screening methods to our structural genomics pipeline to generate multiple ligand-bound structures of high priority drug targets from pathogenic organisms. In this study, we report fragment screening methods and structure determination results for 2C-methyl-D-erythritol-2,4-cyclo-diphosphate (MECP) synthase from Burkholderia pseudomallei, the gram-negative bacterium which causes melioidosis. Screening by nuclear magnetic resonance spectroscopy as well as crystal soaking followed by X-ray diffraction led to the identification of several small molecules which bind this enzyme in a critical metabolic pathway. A series of complex structures obtained with screening hits reveal distinct binding pockets and a range of small molecules which form complexes with the target. Additional soaks with these compounds further demonstrate a subset of fragments to only bind the protein when present in specific combinations. This ensemble of fragment-bound complexes illuminates several characteristics of MECP synthase, including a previously unknown binding surface external to the catalytic active site. These ligand-bound structures now serve to guide medicinal chemists and structural biologists in rational design of novel inhibitors for this enzyme

    Structure of a Burkholderia pseudomallei Trimeric Autotransporter Adhesin Head

    Get PDF
    Pathogenic bacteria adhere to the host cell surface using a family of outer membrane proteins called Trimeric Autotransporter Adhesins (TAAs). Although TAAs are highly divergent in sequence and domain structure, they are all conceptually comprised of a C-terminal membrane anchoring domain and an N-terminal passenger domain. Passenger domains consist of a secretion sequence, a head region that facilitates binding to the host cell surface, and a stalk region.Pathogenic species of Burkholderia contain an overabundance of TAAs, some of which have been shown to elicit an immune response in the host. To understand the structural basis for host cell adhesion, we solved a 1.35 A resolution crystal structure of a BpaA TAA head domain from Burkholderia pseudomallei, the pathogen that causes melioidosis. The structure reveals a novel fold of an intricately intertwined trimer. The BpaA head is composed of structural elements that have been observed in other TAA head structures as well as several elements of previously unknown structure predicted from low sequence homology between TAAs. These elements are typically up to 40 amino acids long and are not domains, but rather modular structural elements that may be duplicated or omitted through evolution, creating molecular diversity among TAAs.The modular nature of BpaA, as demonstrated by its head domain crystal structure, and of TAAs in general provides insights into evolution of pathogen-host adhesion and may provide an avenue for diagnostics

    SAD phasing using iodide ions in a high-throughput structural genomics environment

    Get PDF
    The Seattle Structural Genomics Center for Infectious Disease (SSGCID) focuses on the structure elucidation of potential drug targets from class A, B, and C infectious disease organisms. Many SSGCID targets are selected because they have homologs in other organisms that are validated drug targets with known structures. Thus, many SSGCID targets are expected to be solved by molecular replacement (MR), and reflective of this, all proteins are expressed in native form. However, many community request targets do not have homologs with known structures and not all internally selected targets readily solve by MR, necessitating experimental phase determination. We have adopted the use of iodide ion soaks and single wavelength anomalous dispersion (SAD) experiments as our primary method for de novo phasing. This method uses existing native crystals and in house data collection, resulting in rapid, low cost structure determination. Iodide ions are non-toxic and soluble at molar concentrations, facilitating binding at numerous hydrophobic or positively charged sites. We have used this technique across a wide range of crystallization conditions with successful structure determination in 16 of 17 cases within the first year of use (94% success rate). Here we present a general overview of this method as well as several examples including SAD phasing of proteins with novel folds and the combined use of SAD and MR for targets with weak MR solutions. These cases highlight the straightforward and powerful method of iodide ion SAD phasing in a high-throughput structural genomics environment

    Determination of genetic structure of germplasm collections: are traditional hierarchical clustering methods appropriate for molecular marker data?

    Get PDF
    Despite the availability of newer approaches, traditional hierarchical clustering remains very popular in genetic diversity studies in plants. However, little is known about its suitability for molecular marker data. We studied the performance of traditional hierarchical clustering techniques using real and simulated molecular marker data. Our study also compared the performance of traditional hierarchical clustering with model-based clustering (STRUCTURE). We showed that the cophenetic correlation coefficient is directly related to subgroup differentiation and can thus be used as an indicator of the presence of genetically distinct subgroups in germplasm collections. Whereas UPGMA performed well in preserving distances between accessions, Ward excelled in recovering groups. Our results also showed a close similarity between clusters obtained by Ward and by STRUCTURE. Traditional cluster analysis can provide an easy and effective way of determining structure in germplasm collections using molecular marker data, and, the output can be used for sampling core collections or for association studies
    corecore